feat(pegboard): move pegboard serverless to wf #3416

abcxff · 2025-11-08T01:24:01Z

Fixes RVT-5269

vercel · 2025-11-08T01:24:05Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Preview	Comments	Updated (UTC)
rivetkit-serverless	Error			Nov 19, 2025 5:34pm

3 Skipped Deployments

Project	Deployment	Preview	Updated (UTC)
rivet-cloud	Ignored		Nov 19, 2025 5:34pm
rivet-inspector	Ignored	Preview	Nov 19, 2025 5:34pm
rivet-site	Ignored	Preview	Nov 19, 2025 5:34pm

abcxff · 2025-11-08T01:24:17Z

fix(serverless): general cleanup #3450
feat(pegboard): move pegboard serverless to wf #3416 👈 (View in Graphite)
main

How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

_{An organization admin has enabled the Graphite Merge Queue in this repository.} _{Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.}

This stack of pull requests is managed by Graphite. Learn more about stacking.

linear · 2025-11-08T01:25:16Z

RVT-5269 Move serverless on to wf (gas) engine

pkg-pr-new · 2025-11-08T01:26:15Z

More templates

@rivetkit/actor

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/actor@3416

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@3416

@rivetkit/core

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/core@3416

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@3416

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@3416

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@3416

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@3416

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@3416

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@3416

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@3416

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@3416

commit: 9db0849

claude · 2025-11-08T01:28:15Z

PR Review: Move Pegboard Serverless to Workflow

Summary

This PR migrates pegboard serverless functionality from a separate package into the main pegboard package as workflows, and removes the unused internal package. The serverless runner lifecycle is now managed via the gasoline workflow engine.

Critical Issues

1. Corrupted Error Artifact Generation (BLOCKER)

The PR includes 88 malformed error artifact files with corrupted names containing autocomplete markers like raCompletionMarker and partial error codes.

Examples:

serverless_runner.invalid_configraCompletionMarker.json
seraCompletionMarkerrunner_config.invalid.json
serverless_runner.cnot_found.json

Root Cause: The error definitions in errors.rs are correct, but the artifact generation script has a bug creating incremental artifacts.

Action Required:

Delete all 86 malformed error artifact files
Fix the error artifact generation script
Regenerate only valid error artifacts

Code Quality Issues

2. Unnecessary field name in pool.rs:35
Line has redundant field initialization: config: config
Should be: config

3. Debug comment in runtime.rs:1
Leftover comment: runner wf see how signal fail handling

4. Activity Retry Strategy (runner.rs:115-116)
Using MAX values for timeout and retries is concerning. Infinite retries could cause zombie workflows. Is this intentional? Please add a comment explaining the rationale or set reasonable limits.

5. TODO about lifecycle state (runner.rs:229-234)
Known issue about lifecycle state management when connections close unexpectedly. Consider creating a GitHub issue to track this.

Positive Changes

Architecture Simplification

Consolidates serverless into main pegboard package
Reduces workspace complexity (removed 2 packages)
Improves code cohesion

Workflow-Based State Management

Provides durable state persistence
Better observability
Automatic retry and recovery

Security

Headers properly validated preventing injection attacks
Tokens handled securely

ActivityCtx Enhancements
Added useful methods: find_workflow, get_workflows, state accessor, update_workflow_tags

Test Coverage

No test files found for new serverless workflows. Consider adding:

Unit tests for serverless configuration parsing
Integration tests for pool scaling logic
Tests for graceful draining behavior

Recommendations

CRITICAL: Fix error artifact generation bug
HIGH: Document or fix infinite timeout/retry strategy
MEDIUM: Add tests for serverless workflows
LOW: Clean up debug comment and use shortened field init

Final Verdict

NEEDS CHANGES - Good architecture and code quality, but corrupted error artifacts must be fixed before merging. Once resolved, this is a solid refactoring.

engine/packages/pegboard/src/workflows/actor/mod.rs

engine/packages/pegboard/src/workflows/actor/destroy.rs

engine/packages/pegboard/src/workflows/actor/runtime.rs

engine/packages/engine/src/run_config.rs

engine/packages/api-peer/src/internal.rs

MasterPtato · 2025-11-11T23:52:38Z

engine/packages/pegboard/src/workflows/serverless/connection.rs

+#[timeout = u64::MAX]
+#[max_retries = usize::MAX]


I was wrong about using this for outbound req retries, the issue is the backoff will never reset and will get longer and longer after each request.

Ideally we only have an increasing backoff for actual errors and not the request stopping normally or failing after drain.

This will require a loop with state tracking last retry and retry count and a resetting mechanism (see pb actor reschedule_actor fn for reference)

Tracking with https://linear.app/rivet-gg/issue/RVT-5274/add-backoff-to-pb-serverless

claude · 2025-11-12T20:48:27Z

PR Review: Move Pegboard Serverless to Workflow

Summary

This PR successfully migrates the pegboard serverless autoscaler from a standalone service to a workflow-based architecture. The refactoring removes ~524 lines from pegboard-serverless package and integrates the functionality into the pegboard workflows using the Gasoline workflow engine.

Code Quality & Architecture

Strengths

Clean Separation of Concerns: The new implementation splits responsibilities across three well-defined workflows (pool.rs, runner.rs, connection.rs)
Improved Reliability: Moving to workflows provides durable state management, automatic retries, and better observability
Package Consolidation: Removes two unused packages
Error Handling: Proper structured errors using the custom RivetError system
API Modernization: The bump autoscaler endpoint now uses proper signal-based workflow communication

Areas for Improvement

1. Critical: Unclosed Event Source on Early Exit (connection.rs:196-233)

When the stream ends early, the source EventSource is not properly closed before returning. This can lead to resource leaks.
Recommendation: Call source.close() before returning in the StreamEnded error case.

2. Potential Race Condition (connection.rs:241-262)

The code attempts to send RunnerDrainStarted signal inline with bypass_signal_from_workflow_I_KNOW_WHAT_IM_DOING(), but if it fails, relies on the workflow to send it durably.
Question: Is there a race where both the activity retry and workflow durable send could succeed?

3. TODO Comment About Lifecycle State (connection.rs:225-232)

Known issue where unexpected connection closures before receiving runner_id can cause lifecycle state problems.
Recommendation: Create tracking issue or implement timeout-based cleanup.

4. Gasoline API Extension

The PR adds signal(), find_workflow(), and get_workflows() methods to ActivityCtx. Activities sending signals breaks typical workflow orchestration patterns.
Question: Is this a deliberate architectural decision? What are the implications for workflow determinism?

5. Infinite Retry Configuration (connection.rs:72-73)

Activity has max_retries = usize::MAX and timeout = u64::MAX without documentation.
Recommendation: Add comment explaining why infinite retries are safe.

Security Considerations

Good: Comprehensive header validation, URL validation, proper base64 error handling

Concern: Admin token added to headers - ensure TLS-only and trusted endpoints. Response bodies in error messages (though properly limited to 512 bytes).

Performance

Good: Efficient streaming queries, concurrent workflow operations, graceful draining

Opportunity: Consider aggressive caching for high-frequency scaling, batch signaling for runner draining

Test Coverage

Missing: No new test files for workflow implementations.
Recommendation: Add unit tests for autoscaling logic, integration tests for state transitions, error handling tests, lifecycle tests

Summary

This is a well-architected refactoring that improves maintainability and reliability. Main concerns are resource cleanup and activities sending signals.

Recommendation: Request changes to address EventSource leak, then approve after review.

High Priority

Fix EventSource resource leak on early exit
Document lifecycle state TODO
Clarify design decision on activities sending signals

Medium Priority

Add test coverage
Document infinite retry strategy
Add runner count validation

Generated with Claude Code

engine/packages/api-peer/src/internal.rs

claude · 2025-11-19T17:36:03Z

PR Review: Move Pegboard Serverless to Workflow

Overview

This PR successfully migrates the serverless connection management from a service-based approach to a workflow-based architecture. The refactoring removes the pegboard-serverless and internal packages and consolidates the logic into three new workflows within the pegboard package.

✅ Strengths

Architecture Improvements

Better separation of concerns: The three-workflow structure (pool → runner → connection) provides clear boundaries and responsibilities
Workflow-based lifecycle: Moving to workflows provides better durability, observability, and state management compared to the previous service approach
Proper cleanup: Removed dead code (internal package, unused operations) reducing maintenance burden

Code Quality

Good error handling: The code properly handles edge cases like missing runner configs, invalid status codes, and workflow not found scenarios (connection.rs:459-470)
Exponential backoff: Implements proper retry logic with configurable parameters and reset mechanisms (connection.rs:499-505)
Structured logging: Uses tracing correctly with structured fields rather than formatted strings

Configuration Management

New configuration options: Added sensible defaults for serverless backoff behavior in config/pegboard.rs:59-110
Validation: Proper validation of serverless config including URL parsing and header limits (ops/runner_config/upsert.rs:102-136)

⚠️ Issues & Concerns

Critical Issues

Race condition in pool scaling (pool.rs:23-30, 43-56)
```
// Remove completed runners
state.runners.retain(|r| !completed_runners.contains(r));

// Calculate drain/start counts
let drain_count = state.runners.len().saturating_sub(desired_count);
```
If runners complete between getting completed_runners and draining, the count calculations could be incorrect. Consider recalculating after removal.
Unhandled signal errors (connection.rs:335-356)
The code catches signal send failures and logs them, but returns success with send_drain_started: true. If the workflow-based signal also fails, the parent pool may never know the runner drained. Consider more robust error handling.
Missing timeout on activity (connection.rs:369-379)
is_runner_draining has no timeout, but fetches workflow state. If the database is slow, this could hang indefinitely. Consider adding a timeout.
TODO: Critical lifecycle issue (connection.rs:319-327)
```
// TODO:
// For unexpected closes, we don't know if the runner connected
// or not bc we can't correlate the runner id.
//
// Lifecycle state falls apart
```
This is a significant concern. If the SSE stream closes unexpectedly before receiving the runner ID, the system cannot properly track the runner's lifecycle. This could lead to resource leaks or orphaned runners.

Security Concerns

Header injection risk (connection.rs:225-258)
While headers are validated during upsert (ops/runner_config/upsert.rs:109-136), the error handling silently filters out invalid headers at runtime:
```
.flat_map(|(k, v)| {
    // NOTE: This will filter out invalid headers without warning
    Some((
        k.parse::<HeaderName>().ok()?,
        v.parse::<HeaderValue>().ok()?,
    ))
})
```
This could hide configuration errors. Consider logging warnings for filtered headers.
Admin token exposure (connection.rs:216-223)
The admin token is sent in headers to external serverless endpoints. Ensure these endpoints are trusted and connections are secure (HTTPS enforced).

Code Quality Issues

Deprecated header still sent (connection.rs:251-255)
```
// Deprecated
(
    HeaderName::from_static("x-rivet-namespace-id"),
    HeaderValue::try_from(namespace.name)?,
)
```
If this is deprecated, consider removing it or documenting the deprecation timeline.
Magic constants (connection.rs:23)
```
const DRAIN_GRACE_PERIOD: Duration = Duration::from_secs(5);
```
This should be configurable via the config system like other timeouts.
Unbounded workflow state growth (pool.rs:14-16)
```
struct LifecycleState {
    runners: Vec<Id>,
}
```
The runners vec grows without bounds. With autoscaling, this could become large. Consider pagination or a different tracking mechanism for large deployments.

Testing Concerns

No unit tests: No test files were found in the pegboard package for the new workflow code
No integration tests visible: The PR doesn't show test coverage for the new workflow interactions
Complex state machine: The connection retry logic and lifecycle management are complex but untested

Minor Issues

Potential integer overflow (pool.rs:173-178)

let desired_count = (runners_margin
    + (adjusted_desired_slots as u32).div_ceil(slots_per_runner))
.max(min_runners)
.min(max_runners)
.try_into()?;

The comment says these are in u32 range, but there's no validation. Consider explicit bounds checking.

Unclear comment (connection.rs:339)
```
// This is ok, because we only send DrainStarted once
.bypass_signal_from_workflow_I_KNOW_WHAT_IM_DOING()
```
The bypass is used, but the rationale could be clearer. What happens if it's sent multiple times?
Inconsistent error handling (runner.rs:52-54)
The runner workflow waits for the connection workflow to complete, but doesn't check if it succeeded or failed. Consider handling workflow errors.

📋 Recommendations

High Priority

Add comprehensive tests for the new workflows, especially:
- Connection retry logic and backoff behavior
- Pool scaling edge cases
- Drain signal handling
- Error recovery scenarios
Resolve the TODO at connection.rs:319-327 regarding lifecycle tracking for unexpected closes
Add timeouts to all activities that make external calls or database queries
Document the workflow interaction patterns and state machines

Medium Priority

Make DRAIN_GRACE_PERIOD configurable
Add warnings for filtered invalid headers
Consider adding metrics/alerts for:
- Connection retry counts
- Failed signal deliveries
- Unexpected stream closes

Low Priority

Remove deprecated headers or document the timeline
Add bounds checking for integer conversions
Consider implementing pagination for large runner lists in pool state

🎯 Overall Assessment

This is a solid architectural improvement that moves serverless connection management to a more maintainable workflow-based approach. The code quality is generally good with proper error handling and structured logging. However, the lack of tests for such critical infrastructure code is concerning, and there are several edge cases (race conditions, signal failures, lifecycle tracking) that need addressing before this can be considered production-ready.

Recommendation: Request changes to address the critical issues, especially the lifecycle tracking TODO and test coverage, before merging.

Note: Review based on repository conventions in CLAUDE.md

abcxff changed the title ~~feat: pegboard serverless pool~~ feat(pegboard): move pegboard serverless to wf Nov 8, 2025

abcxff requested a review from MasterPtato November 8, 2025 01:25

abcxff marked this pull request as ready for review November 8, 2025 01:25

vercel bot deployed to Preview – rivetkit-serverless November 8, 2025 01:25 View deployment

vercel bot deployed to Preview – rivet-site November 8, 2025 01:27 View deployment

abcxff marked this pull request as draft November 8, 2025 02:08

abcxff force-pushed the 11-05-feat_pegboard_serverless_pool branch from 250de22 to ddb0414 Compare November 10, 2025 05:15

abcxff marked this pull request as ready for review November 10, 2025 05:15

vercel bot deployed to Preview – rivetkit-serverless November 10, 2025 05:16 View deployment

vercel bot deployed to Preview – rivet-site November 10, 2025 05:18 View deployment

MasterPtato force-pushed the 11-05-feat_pegboard_serverless_pool branch from ddb0414 to be0a1d2 Compare November 11, 2025 23:41

MasterPtato requested changes Nov 11, 2025

View reviewed changes

MasterPtato mentioned this pull request Nov 11, 2025

fix(serverless): general cleanup #3450

Open

vercel bot deployed to Preview – rivetkit-serverless November 11, 2025 23:42 View deployment

vercel bot deployed to Preview – rivet-site November 11, 2025 23:44 View deployment

MasterPtato requested changes Nov 11, 2025

View reviewed changes

vercel bot deployed to Preview – rivetkit-serverless November 12, 2025 20:46 View deployment

abcxff requested a review from MasterPtato November 12, 2025 23:20

MasterPtato requested changes Nov 13, 2025

View reviewed changes

engine/packages/api-peer/src/internal.rs Outdated Show resolved Hide resolved

abcxff and others added 2 commits November 19, 2025 17:20

feat: pegboard serverless pool

432ed66

fix(serverless): general cleanup

1955f5d

abcxff force-pushed the 11-05-feat_pegboard_serverless_pool branch from 9db0849 to 1955f5d Compare November 19, 2025 17:34

vercel bot had a problem deploying to Preview – rivetkit-serverless November 19, 2025 17:34 Failure

abcxff requested a review from MasterPtato November 19, 2025 17:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(pegboard): move pegboard serverless to wf #3416

feat(pegboard): move pegboard serverless to wf #3416

Uh oh!

abcxff commented Nov 8, 2025 •

edited

Loading

Uh oh!

vercel bot commented Nov 8, 2025 •

edited

Loading

Uh oh!

abcxff commented Nov 8, 2025 •

edited

Loading

Uh oh!

linear bot commented Nov 8, 2025

Uh oh!

pkg-pr-new bot commented Nov 8, 2025 •

edited

Loading

Uh oh!

claude bot commented Nov 8, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MasterPtato Nov 11, 2025

Uh oh!

claude bot commented Nov 12, 2025

Uh oh!

Uh oh!

claude bot commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		#[timeout = u64::MAX]
		#[max_retries = usize::MAX]

feat(pegboard): move pegboard serverless to wf #3416

Are you sure you want to change the base?

feat(pegboard): move pegboard serverless to wf #3416

Uh oh!

Conversation

abcxff commented Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vercel bot commented Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abcxff commented Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to use the Graphite Merge Queue

Uh oh!

linear bot commented Nov 8, 2025

Uh oh!

pkg-pr-new bot commented Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude bot commented Nov 8, 2025

PR Review: Move Pegboard Serverless to Workflow

Summary

Critical Issues

Code Quality Issues

Positive Changes

Test Coverage

Recommendations

Final Verdict

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MasterPtato Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

claude bot commented Nov 12, 2025

PR Review: Move Pegboard Serverless to Workflow

Summary

Code Quality & Architecture

Strengths

Areas for Improvement

1. Critical: Unclosed Event Source on Early Exit (connection.rs:196-233)

2. Potential Race Condition (connection.rs:241-262)

3. TODO Comment About Lifecycle State (connection.rs:225-232)

4. Gasoline API Extension

5. Infinite Retry Configuration (connection.rs:72-73)

Security Considerations

Performance

Test Coverage

Summary

High Priority

Medium Priority

Uh oh!

Uh oh!

claude bot commented Nov 19, 2025

PR Review: Move Pegboard Serverless to Workflow

Overview

✅ Strengths

Architecture Improvements

Code Quality

Configuration Management

⚠️ Issues & Concerns

Critical Issues

Security Concerns

Code Quality Issues

Testing Concerns

Minor Issues

📋 Recommendations

High Priority

Medium Priority

Low Priority

🎯 Overall Assessment

Uh oh!

Reviewers

Assignees

Labels

abcxff commented Nov 8, 2025 •

edited

Loading

vercel bot commented Nov 8, 2025 •

edited

Loading

abcxff commented Nov 8, 2025 •

edited

Loading

pkg-pr-new bot commented Nov 8, 2025 •

edited

Loading